fix: Fix KVBM GPU memory leak #4171

krishung5 · 2025-11-07T01:02:53Z

Overview:

There are two separate but related fixes addressing different aspects of the memory leak/hang problem:

Issue 1: Race Condition in Finish Event Communication:

When vLLM finishes a request, it calls request_finished() on the leader. The leader and workers must coordinate to clean up GPU blocks, but there was a race condition where the leader would delete the slot before workers could process the finish event.

Time  | Leader Thread                    | Worker Thread
------|----------------------------------|----------------------------------
T1    | vLLM calls request_finished()    |
T2    | Mark slot as "Finishing"         |
      | (has 3 pending offload ops)      |
T3    | Remove slot from slot_manager    | 
      | Return to vLLM                   |
T4    |                                  | Receive finish event via metadata
T5    |                                  | Look for slot... NOT FOUND!
T6    |                                  | Skip cleanup (no slot)
T7    |                                  | Operations never complete
      |                                  | GPU blocks never freed
      |                                  | → MEMORY LEAK

Fix:
Keep Slots Alive in Finishing State - changes in leader.rs

Keeps the slot alive when it has pending operations (Finishing state)

Prevent New Operations After Finishing - changes in slot.rs

In offload_blocks, If we're finishing, don't create new operations.

Issue 2: Offload operations that failed didn't notify the scheduler

When an offload operation fails (e.g., "Not enough blocks available" due to CPU cache being full), the error is logged but no completion notification is sent to the scheduler. This leaves the scheduler's completed counter stuck, preventing slots from ever being marked as complete.

Time  | Leader Thread                    | Worker Thread
------|----------------------------------|----------------------------------
T1    | Offload operation created        |
T2    | Operation enqueued to workers    |
T3    | Worker starts processing         |
T4    | Operation fails:                 |
      | "Not enough blocks available"    |
T5    | Error logged                     |
      | NO completion notification!      |
T6    |                                  | Scheduler: completed=2, total=3
T7    |                                  | is_complete() returns false
T8    |                                  | Slot never becomes complete
T9    |                                  | GPU blocks never freed
      |                                  | → MEMORY LEAK

Fix:
Notify scheduler that this operation is "complete" - changes in slot.rs

Create a fake/immediate transfer request that completes instantly
This increments the workers' completed counter so they can progress

Closes DIS-848

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved request completion handling with proper state tracking and synchronization during the finalization process
- Enhanced error handling in block transfer operations to ensure completion notifications are properly propagated even when issues occur
- Added guards to safely handle operations during finalization stages
- Strengthened type safety for internal transfer mechanisms

krishung5 · 2025-11-07T01:03:09Z

/ok to test 6097426

Signed-off-by: krishung5 <[email protected]>

coderabbitai · 2025-11-07T16:08:48Z

Walkthrough

The changes enhance slot lifecycle management in the vLLM block manager connector. Updates include acquiring locks before marking slots finished, conditional slot removal based on state (Finished vs. Finishing), early-exit guards during finalization, explicit typing for transfer channels, and synthetic completion notifications on offload failures to ensure progress in error scenarios.

Changes

Cohort / File(s)	Change Summary
Slot Request Completion and Lock Safety `lib/kvbm/src/block_manager/vllm/connector/leader.rs`	`request_finished` now acquires slot lock before marking it finished via `mark_as_finished`. Slot removal is now conditional: only removed when state is Finished; kept alive when Finishing to allow pending operations. Added clarifying comments on locking and state transitions.
Slot State Guards and Transfer Improvements `lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs`	Added early-exit guard in `offload_blocks` to no-op when slot is Finishing or Finished. Transfer channels now use explicit types (LocalOnboardRequest, LocalOffloadRequest). On offload failure, synthetic completion notification is emitted with captured request_id and operation_id to signal progress to scheduler.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Lock ordering and deadlock potential: Verify that the new lock acquisition in request_finished does not introduce deadlock scenarios with existing locking patterns.
State transition semantics: Ensure the Finishing vs. Finished distinction correctly prevents operations on finalizing slots while preserving pending work.
Synthetic request construction: Validate that the fake LocalOffloadRequest and BlockTransferRequest are constructed correctly and that captured identifiers flow through notification pathways without loss.
Error path coverage: Confirm that the synthetic completion mechanism handles all offload failure scenarios and that scheduler counters advance correctly.

Poem

🐰 A slot's lifecycle, now locked up tight,
With states that finish, or hold the light,
No more hasty removals in the fray,
Synthetic signals save the day!
Transfer types bloom where channels align,
Our vLLM connector feels divine! ✨

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title concisely describes the main fix: addressing a GPU memory leak in KVBM, which is the primary objective of the PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check	✅ Passed	The pull request description provides comprehensive overview, detailed explanations of two separate issues with timing diagrams, specific fixes for each issue, and identifies key files for review. All required template sections are addressed with substantive content.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs (1)
992-996: Early-exit guard prevents operations during finalization.

The guard correctly prevents new operations from being created when the slot is Finishing or Finished, addressing the race condition described in Issue 1. However, consider logging a debug message when this path is taken to aid observability and debugging, since callers may not be aware the operation was skipped.

Consider adding a debug log:
 // Check if slot is in Finishing state before creating operations
 // If we're finishing, don't create new operations
 if matches!(self.state, SlotState::Finishing | SlotState::Finished) {
+    tracing::debug!(
+        request_id = %self.request_id,
+        "skipping offload_blocks because slot is {:?}",
+        self.state
+    );
     return Ok(());
 }

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f509493 and 1b820f7.

📒 Files selected for processing (2)

lib/kvbm/src/block_manager/vllm/connector/leader.rs (1 hunks)
lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs (4 hunks)

🧰 Additional context used

🧠 Learnings (6)

📚 Learning: 2025-09-03T19:31:32.621Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2840
File: lib/llm/src/kv_router/sequence.rs:86-88
Timestamp: 2025-09-03T19:31:32.621Z
Learning: PeaBrane chose to defer fixing the corner case where a single late-arriving request might never expire in the ActiveSequences expiry mechanism (lib/llm/src/kv_router/sequence.rs). They prefer to avoid adding a background loop for periodic cleanup at this time, accepting the technical debt to keep the current PR scope contained.

Applied to files:

lib/kvbm/src/block_manager/vllm/connector/leader.rs

📚 Learning: 2025-09-18T21:41:02.263Z

Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/distributed/transfer.rs:59-60
Timestamp: 2025-09-18T21:41:02.263Z
Learning: The codebase has a robust two-layer transfer management system: TransferBatcher (offload/pending.rs:400-485) handles batching large transfers into MAX_TRANSFER_BATCH_SIZE chunks, and LocalTransferManager (offload/pending.rs:280-286) limits concurrency to MAX_CONCURRENT_TRANSFERS using FuturesUnordered.

Applied to files:

lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs

📚 Learning: 2025-09-17T01:00:50.937Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.

Applied to files:

lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs

📚 Learning: 2025-06-02T19:37:27.666Z

Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.

Applied to files:

lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs

📚 Learning: 2025-05-29T06:20:12.901Z

Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

Applied to files:

lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs

📚 Learning: 2025-06-04T18:43:04.566Z

Learnt from: jthomson04
Repo: ai-dynamo/dynamo PR: 1363
File: lib/llm/src/block_manager/block/transfer.rs:206-216
Timestamp: 2025-06-04T18:43:04.566Z
Learning: For NIXL transfers in the KVBM system, the future returned by `nixl::write_blocks_to` is independent of the underlying transfer execution. The transfer begins immediately when `nixl::write_blocks_to` is called, and the returned future is only used for notification/completion tracking. Therefore, it's safe to drop the future when notification is not needed (`notify == false`).

Applied to files:

lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs

🧬 Code graph analysis (1)

lib/kvbm/src/block_manager/vllm/connector/leader.rs (1)

lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs (4)

request_id (94-94)

request_id (431-433)

state (96-96)

state (435-437)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: trtllm (arm64)
GitHub Check: sglang (amd64)
GitHub Check: trtllm (amd64)
GitHub Check: sglang (arm64)
GitHub Check: vllm (amd64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (5)

lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs (3)

1182-1183: LGTM: Explicit channel typing improves clarity.

The explicit typing of transfer channels enhances code readability and type safety without altering functionality.

1222-1223: LGTM: Captures IDs for synthetic completion notification.

Correctly captures request_id and operation_id before the request is moved, enabling the synthetic completion pathway on failure.

1234-1258: Synthetic completion notification with empty blocks is safe and correctly addresses the failure scenario.

The implementation properly handles offload processing failures:

Empty BlockTransferRequest with RequestType::Immediate completes instantly through the completion channel

ImmediateTransferCompletionHandle sends the completion notification regardless of block count, ensuring the scheduler's completed counter increments even on failure

The Drop impl (protocol.rs:276-288) enforces that mark_complete() must be called, preventing silent completion tracking failures

Silent error handling (lines 1255-1257) is acceptable since the error is already logged above

The solution correctly ensures scheduler progress even when offload processing fails.

lib/kvbm/src/block_manager/vllm/connector/leader.rs (2)

510-518: LGTM: Lock acquisition prevents race condition.

Acquiring the lock before marking the slot as finished (line 518) is critical to preventing the race condition described in Issue 1. This ensures that the state transition to Finishing happens atomically with respect to any operation creation attempts in offload_blocks(). Combined with the early-exit guard added to slot.rs (lines 992-996), this prevents workers from missing the finish event because the slot was removed before they could process pending operations.

527-538: Code is correct - no changes needed.

The implementation properly handles the Finishing state and prevents memory leaks:

Worker's is_complete() checks that all operations are complete by comparing completed count to total operations

The remove_slot() method includes a defensive assertion requiring slot.is_complete() to be true before removal

Leader extracts pending operations via take_pending_operations() and adds them to the scheduler so they complete before is_complete() returns true

Worker panics if a slot unexpectedly disappears while in the maybe_finished set, catching protocol violations

The conditional removal logic correctly differentiates between Finishing (operations pending) and Finished (all resources released), ensuring blocks remain available while workers process pending operations.

ziqifan617

awesome work!

ziqifan617 · 2025-11-07T18:26:46Z

lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs

+                        // Notify scheduler that this operation is "complete" (even though it failed)
+                        // Create a fake/immediate transfer request that completes instantly
+                        // This increments the workers' completed counter so they can progress


Suggested change

// Notify scheduler that this operation is "complete" (even though it failed)

// Create a fake/immediate transfer request that completes instantly

// This increments the workers' completed counter so they can progress

// Create a fake/immediate transfer request that completes instantly.

// Otherwise, worker side might stuck and cause memory leak.

ziqifan617 · 2025-11-07T18:37:13Z

lib/kvbm/src/block_manager/vllm/connector/leader.rs

+        // Mark the slot as finished (sets state to Finishing if there are operations,
+        // or Finished if all operations are complete)
+        slot.mark_as_finished(self.iteration_counter)?;


hmmm, by checking the code, I am wondering where slot would be marked as SlotState::Finished. I only saw places to make it as SlotState::Finishing? will follow up offline

Thanks for the catch. Resolved offline. Fixed in the latest commit

Signed-off-by: krishung5 <[email protected]>

pull-request-size bot added the size/M label Nov 7, 2025

github-actions bot added the fix label Nov 7, 2025

krishung5 force-pushed the krish/kvbm-mem-leak branch from 6097426 to e2e3b9f Compare November 7, 2025 02:43

copy-pr-bot bot temporarily deployed to GITLAB November 7, 2025 02:43 Inactive

krishung5 added 2 commits November 7, 2025 07:30

WIP: Fix gpu leak

0048814

Signed-off-by: krishung5 <[email protected]>

Merge remote-tracking branch 'origin' into krish/kvbm-mem-leak

1b820f7

Signed-off-by: krishung5 <[email protected]>

krishung5 force-pushed the krish/kvbm-mem-leak branch from e2e3b9f to 1b820f7 Compare November 7, 2025 15:30

copy-pr-bot bot temporarily deployed to GITLAB November 7, 2025 15:31 Inactive

krishung5 marked this pull request as ready for review November 7, 2025 15:58

krishung5 requested a review from a team as a code owner November 7, 2025 15:58

coderabbitai bot reviewed Nov 7, 2025

View reviewed changes

ziqifan617 reviewed Nov 7, 2025

View reviewed changes

krishung5 added 2 commits November 7, 2025 11:17

Address comment

dff6fcc

Merge remote-tracking branch 'origin' into krish/kvbm-mem-leak

3a40d30

Signed-off-by: krishung5 <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB November 7, 2025 19:18 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 7, 2025 19:20 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Fix KVBM GPU memory leak #4171

fix: Fix KVBM GPU memory leak #4171

krishung5 commented Nov 7, 2025 •

edited

Loading

Uh oh!

krishung5 commented Nov 7, 2025

Uh oh!

coderabbitai bot commented Nov 7, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

ziqifan617 left a comment

Uh oh!

ziqifan617 Nov 7, 2025

Uh oh!

ziqifan617 Nov 7, 2025

Uh oh!

krishung5 Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: Fix KVBM GPU memory leak #4171

Are you sure you want to change the base?

fix: Fix KVBM GPU memory leak #4171

Conversation

krishung5 commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Release Notes

Uh oh!

krishung5 commented Nov 7, 2025

Uh oh!

coderabbitai bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

ziqifan617 left a comment

Choose a reason for hiding this comment

Uh oh!

ziqifan617 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

ziqifan617 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

krishung5 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

krishung5 commented Nov 7, 2025 •

edited

Loading

coderabbitai bot commented Nov 7, 2025 •

edited

Loading